Visualisation with ggplot

Etienne Côme

October, 26 2020

Visualisation ?

“Transformation of the symbolic into the geometric”
[McCormick et al. 1987]

“… finding the artificial memory that best supports our natural means of perception.”
[Bertin 1967]

“The use of computer-generated, interactive, visual representations of data to amplify cognition.”
[Card, Mackinlay, & Shneiderman 1999]

Why visualize?

Integrating the human in the loop

  • Answer questions or find questions?
  • Making decisions
  • Putting data in context
  • Amplify the memory
  • Graphic calculation
  • Find schematics and patterns
  • Presenting arguments

Why visualize?

Analyze :

  • Developing and criticizing hypotheses
  • Discovering errors
  • Find patterns

Communicate

  • Sharing and convincing
  • Collaborate and review

Anscombe quartet

g mean_x mean_y sd_x sd_y
1 9 7.500909 3.316625 2.031568
2 9 7.500909 3.316625 2.031657
3 9 7.500000 3.316625 2.030424
4 9 7.500909 3.316625 2.030578

Anscombe quartet

Cholera map (John Snow)

Visualization

=

encode the data using

visual chanels

Visual channels

Bertin Jacques, Sémiologie graphique, Paris, Mouton/Gauthier-Villars, 1967.

Visual channels

Bertin Jacques, Sémiologie graphique, Paris, Mouton/Gauthier-Villars, 1967.

Marks // visuals channels

Marks :

graphical building blocks

Visual channels :

The visual properties that varie

Marks, visual channels

Marks, visual channels

All channels are not equals

Marks, visual channels

The best channels depend on the feature type (continuous, categorical, ordinal,…)

Marks, visual channels

The interesting part is not already available

pre-attentive processing

How many 3 ?

1281768756138976546984506985604982826762 9809858458224509856458945098450980943585 9091030209905959595772564675050678904567 8845789809821677654876364908560912949686

pre-attentive processing

How many 3 ?

1281768756138976546984506985604982826762 9809858458224509856458945098450980943585 9091030209905959595772564675050678904567 8845789809821677654876364908560912949686

pre-attentive processing

pre-attentive processing

pre-attentive processing

Questions ? Features types ?

continuous ? discretes ? ordinals ? temporal ? spatials ?

Some categories

and

one quantity for each modality

The bar chart

le bar chart

dataset mpg

  • manufacturer.

  • model.

  • displ. engine displacement, in litres

The bar chart

Order ?

Horizontal ?

The ligne :

1 numeric variable

with respect

to time

Vélib’ data :

Time, natural order

Time, natural order

Aspect ratio

Aspect ratio

Aspect ratio

Aspect ratio, 45°

Heuristic: use the aspect ratio that results in an average line slope of 45°.

Cleveland, William S., Marylyn E. McGill, and Robert McGill. “The shape parameter of a two-variable graph.” Journal of the American Statistical Association 83.402 (1988): 289-300.

Area + Scale

Point of view

1 numeric variable

with respect

to time

+ categories

Velib data per stations

Multiple line charts

Small multiples

2 numeric features

+ categories

Scatter plot + colors

Scatter plot + symbols

3 numeric features (with one >0)

+ categories

Scatter plot + color + size

Scatter plot + color + size ! scales

Circle size : radius or area ?

Rayon

Aire

Principle :

\[\textrm{Lie factor} = \frac{\textrm{visual effect size}}{\textrm{data effect size}}\]

Lie factor :

\[\textrm{data effect size} = \frac{27.5 - 18}{18} \times 100 = 53 \%\]

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Lie factor :

\[\textrm{visual effect size} = \frac{5.3 -0.6}{0.6} \times 100 = 783 \%\]

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Lie factor :

\[\textrm{Lie factor} = \frac{783}{53} = 14.8\]

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Lie factor : 9.4

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

knowing that the “apple”" area (in green ) is equal to \(2.22\,cm^2\) and that the rim area (in blue) is equal to \(2.96\,cm^2\) compute the lyong factor ?

Perception

\[S = I^p\]

Principle :

Increase the data density

\[\textrm{graph data density} = \frac{\textrm{number of entries in data matrix}}{\textrm{area of data display}} \]

Data density :

Avoid graphics with low data density

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Data density :

Avoid graphics with low data density

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Principle :

Increase the data-ink ratio

\[\textrm{data-ink ratio} = \frac{\textrm{area of data-ink}}{\textrm{total area of ink}}\]

Data-ink ratio :

Data-ink ratio :

Remove to improve

https://speakerdeck.com/cherdarchuk/remove-to-improve-the-data-ink-ratio

Data-ink ratio :

Remove to improve

https://www.youtube.com/watch?v=bDbJBWvonVI

Recap

  • Avoid misleading graphics !
  • Avoid empty graphics
  • Be parsimonius with ink
  • Scales !, (!colors, !size)
  • Use explicit labels and
  • ! categorial features and order
  • aspect ratio
  • filetype pdf, svg // png,jpg

ggplot

gg = grammar of graphics

  • “The Grammar of Graphics” (Wilkinson, Annand and Grossman, 2005)
  • grammar → same language for all figures

ggplot

building blocks of the grammar

  • the coordinate system
  • data and aesthetic mappings,
    ex : f(data) → x position, y position, size, shape, color
  • geometric objects,
    ex : points, lines, bars, texts
  • scales,
    ex : f([0, 100]) → [0, 5] px
  • facet specification,
    ex : split the data into several plots
  • statistical transformations,
    ex : average, coounting, regression

ggplot

Make a graphic :

  • add several layers
  • with their own visual encoding and possibly their own data
  • (+ optionel) add statistical transformation
  • (+ optionel) change scale options
  • (+ optionel) specify title, theme, guides, style …


! data = tidy data.frame with the right feature types

ggplot, géométries

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)

Exemple


ggplot(mpg)+
  geom_point(aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))
ggplot(mpg,aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))+
  geom_jitter()

ggplot

ggplot

ggplot

ggplot

ggplot, scales

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)
  • (+ optionel) change scale options
    scale_fill_brewer(palette=3,type="qual")
    scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))

ggplot, scales

Colors

scales

Color scales

http://colorbrewer2.org/

ggplot, faceting

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)
  • (+ optionel) change scale options
    scale_fill_brewer(palette=3,type="qual")
    scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
  • use facet ?
    facet_grid(. ~ cyl)

ggplot, faceting

ggplot, stats

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)
  • (+ optionel) change scale options
    scale_fill_brewer(palette=3,type="qual")
    scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
  • add statistics
    stat_density2d()

ggplot

ggplot

ggplot

Sources

Exercises

Update the scale and labels

Exercises

Update the scale and labels

Exercises

Reproduce this graphic (Iris data)

Exercices

Reproduce this graphic (mtcars data) ! modifier le theme du graphique ?theme

Exercises

Reproduce this graphic

Exercises

Reproduce this graphic Informations :
  • Bike sharing data from lyon (data folder)
  • Compute the occupancy rate nb bikes / max(nb bikes)
  • pivot to wide
  • do a k-means with 8 clusters X (rows = stations, column = time slot)
  • facet + mean curve + alpha blending